fit_1 <- stan_glm(kid_score ~ mom_hs, data = kidiq)SSPS4102 Data Analytics in the Social Sciences
SSPS6006 Data Analytics for Social Research
Semester 1, 2026
Last updated: 2026-01-23
I would like to acknowledge the Traditional Owners of Australia and recognise their continuing connection to land, water and culture. The University of Sydney is located on the land of the Gadigal people of the Eora Nation. I pay my respects to their Elders, past and present.
By the end of this lecture, you will be able to:
TSwD
ROS
Last week we covered simple linear regression with one predictor:
\[y_i = \beta_0 + \beta_1 x_i + \epsilon_i\]
This week we extend to multiple predictors:
\[y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \cdots + \beta_k x_{ki} + \epsilon_i\]
Key Advantage
Adding more explanatory variables allows associations between the outcome and predictor of interest to be assessed while adjusting for other explanatory variables.
The results can be quite different from separate simple regressions, especially when explanatory variables are correlated with each other.
We’ll use data predicting children’s cognitive test scores from:
mom_hs: 0 or 1)mom_iq: continuous)Let’s build up from simple to multiple regression.
Example from ROS Chapter 10
Fitted model: \[\text{kid\_score} = 78 + 12 \times \text{mom\_hs} + \text{error}\]
Interpretation
What this tells us
The coefficient represents the difference in means between the two groups.
Fitted model: \[\text{kid\_score} = 26 + 0.6 \times \text{mom\_iq} + \text{error}\]
Interpretation
Comparing children whose mothers differ by 1 point in IQ, we expect the child’s test score to differ by 0.6 points on average.
A 10-point difference in mothers’ IQs → 6-point difference in children’s scores.
Output:
Median MAD_SD
(Intercept) 25.7 5.9
mom_hs 6.0 2.4
mom_iq 0.6 0.1
sigma 18.2 0.6
Fitted model: \[\text{kid\_score} = 26 + 6 \times \text{mom\_hs} + 0.6 \times \text{mom\_iq} + \text{error}\]
Child’s test score vs maternal IQ, with separate groups for maternal education
Coefficient of mom_hs (6.0)
Comparing children whose mothers have the same IQ, but differ in high school completion:
Coefficient of mom_iq (0.6)
Comparing children with same maternal education, but mothers differ by 1 IQ point:
Key Insight
Each coefficient represents the effect of that variable holding the other variables constant.
Definition
The coefficient \(\beta_k\) is the average or expected difference in outcome \(y\), comparing two observations that differ by one unit in predictor \(x_k\) while being equal in all other predictors.
This is sometimes called:
Predictive Interpretation
How does the outcome differ, on average, when comparing two groups that differ by 1 in the predictor?
“Children whose mothers have 1 point higher IQ tend to score 0.6 points higher”
Counterfactual Interpretation
What would happen if we changed the predictor by 1 unit?
“Increasing maternal IQ by 1 would increase child’s score by 0.6”
Caution
The counterfactual interpretation requires causal assumptions that may not be justified from observational data alone.
The most careful way to interpret regression coefficients:
“When comparing two children whose mothers have the same level of education, the child whose mother is x IQ points higher is predicted to have a test score that is \(0.6x\) higher, on average.”
This is awkward but accurate! It avoids implying causation when we only have correlation.
When a predictor has only two categories, we create an indicator (or dummy) variable:
Median MAD_SD
(Intercept) 149.6 1.0
c_height 3.9 0.3
male 12.0 2.0
sigma 28.8 0.5
Interpretation of male coefficient
Comparing a man to a woman of the same height, the man is predicted to be 12 pounds heavier.
When a predictor has more than two categories, R automatically creates indicator variables:
Output:
Median MAD_SD
(Intercept) 154.1 2.2
c_height 3.8 0.3
male 12.2 2.0
factor(ethnicity)Hispanic -5.9 3.6
factor(ethnicity)Other -12.6 5.2
factor(ethnicity)White -5.0 2.3
What happened to “Black”?
R uses one category as the baseline (reference group). All other coefficients are interpreted relative to this baseline.
Now all coefficients are interpreted relative to White:
Median MAD_SD
(Intercept) 149.1 1.0
ethBlack 5.0 2.2
ethHispanic -0.9 2.9
ethOther -7.6 4.6
Sometimes the effect of one variable depends on the value of another variable.
Example: The relationship between maternal IQ and child’s test score might be different for children whose mothers completed high school vs those who didn’t.
Fitted model: \[\text{kid\_score} = -11 + 51 \times \text{mom\_hs} + 1.1 \times \text{mom\_iq} - 0.5 \times \text{mom\_hs} \times \text{mom\_iq}\]
Interaction between maternal education and IQ
The slopes are different for each group!
For mom_hs = 0 (no high school)
\[\text{kid\_score} = -11 + 1.1 \times \text{mom\_iq}\]
Slope = 1.1
For mom_hs = 1 (completed high school)
\[\text{kid\_score} = 40 + 0.6 \times \text{mom\_iq}\]
Slope = 0.6
The interaction coefficient (-0.5) is the difference in slopes between the two groups.
The * operator automatically includes:
x1x2x1:x2When to Look for Interactions
Start with predictors that have large coefficients when not interacted. If a variable has a big effect overall, its effect may also differ across subgroups.
This model includes:
Example from TSwD Chapter 12
After fitting a regression \(y = a + bx + \text{error}\), we can make three predictions:
ROS Chapter 9
predict()This gives a single number — our best guess for the outcome.
posterior_linpred()This returns a vector of simulations representing:
posterior_predict()This includes both:
Key Difference
Even with infinite data (perfect coefficient estimates), predictive uncertainty never goes to zero because of residual variation \(\sigma\).
Advantage of Simulation
We can compute any function of the predictions, including nonlinear summaries like “probability of winning”.
Frequentist (lm())
lm() in RBayesian (stan_glm())
stan_glm() in RTSwD Chapter 12
\[y_i | \mu_i, \sigma \sim \text{Normal}(\mu_i, \sigma)\] \[\mu_i = \beta_0 + \beta_1 x_i\] \[\beta_0 \sim \text{Normal}(0, 2.5)\] \[\beta_1 \sim \text{Normal}(0, 2.5)\] \[\sigma \sim \text{Exponential}(1)\]
Priors
The prior distributions express our beliefs about the parameters before seeing the data. The data then updates these to produce posterior distributions.
What do the priors imply?
Before running the model, simulate from the priors to check if they produce sensible predictions.
Prior predictive check showing implied predictions
If priors imply impossible values (e.g., negative marathon times), refine your priors!
The autoscale = TRUE option adjusts priors based on the data scale.
Check what was used with prior_summary(model).

If the model fits well, simulated data should look like actual data.
Trace Plot
Look for: horizontal, overlapping chains
Rhat Plot
Look for: all values close to 1.0 (< 1.1)
modelsummary() for ComparisonThis creates a publication-quality table comparing multiple models side by side.
| Metric | Interpretation |
|---|---|
| R² | Proportion of variance explained |
| Adjusted R² | R² penalised for number of predictors |
| AIC/BIC | Information criteria (lower is better) |
| RMSE | Root mean squared error |
| LOOIC | Leave-one-out cross-validation IC |
Which to use?
When using an unfamiliar dataset, check for:
Most Important
Is the model directly relevant to your research question?
A confounder is a variable that:
Including confounders in the model can change the estimated effect of your predictor of interest.
Simple regression
With confounder
The effect of education decreases when we control for parental income — part of the apparent education effect was actually due to family background.
Include variables that:
Don’t include:
Statistical Concepts
Bayesian Additions
# Frequentist
lm(y ~ x1 + x2 + x3, data = df)
lm(y ~ x1 * x2, data = df) # With interaction
# Bayesian
stan_glm(y ~ x1 + x2, data = df,
prior = normal(0, 2.5))
# Predictions
predict(fit, newdata = new)
posterior_linpred(fit, newdata = new)
posterior_predict(fit, newdata = new)
# Diagnostics
pp_check(fit)
prior_summary(fit)| Formula | Meaning |
|---|---|
y ~ x1 + x2 |
Main effects only |
y ~ x1:x2 |
Interaction only |
y ~ x1 * x2 |
Main effects + interaction |
y ~ factor(x) |
Categorical with dummies |
y ~ x1 + I(x1^2) |
Polynomial term |
Week 8: Model Diagnostics and Communication
Readings: